Define PCA transform to keep PC's accounting for 95% of data variance

Prediction Model Data Breakdown

Popular data will be seperated from raw data. As regression will be used to predict popular chart rankings, all models will be run on popular data only. Datasets used will be all popular data, and the 5 genre breakdowns.

The following data variants will be created per dataset:

  1. NOCOR
    • Dropping highly correlated variables
      • Will be created and used alongside model testing
  2. NODUR
    • Dropping duration
      • Duration was found to be overly valued during classification, but intution says it should not be particularly relevent to a songs popularity. Dropping to observe differences
  3. RECENT
    • Subset of only most recent 3 years of data
      • Used to see if stronger trends emerge when looking at only recent data. Assumption here is that song trends change somewhat frequently and past year data may throw off results when grouped together
  4. PCA
    • PCA performed on continuous variables, along with dummy variables unchanged

Pull 3 most recent years: 2018, 2019, 2020

Get dummy vars for years

Drop duration

PCA on continuous, with original dummies

Split by genre, create variants per

There are no notable differences in correlations between the popular dataset and the dataset as a whole

Linear Regression Model Analysis on Data Subsets and Variants

The remaining portion of the notebook will be evaluating linear regression models to predict chart rank for a given popular song. Chart rank will be treated as continous in this context, so evaluation of the model will have some leniency in its result when considering accuracy.

Define 'Model' class: Holds a linear regression model object from sklearn, paramater list to explore, and gridsearch object. Class is used to more efficiently test models on changing data sets.

Functions enable 2D graphing of alpha parameter for Ridge and Lasso models, as well as 3D graphing of l1_ratio/alpha for for Stochastic Gradient Descent (SGD) model

Define constructor method to build model listings and begin evaluation

Ridge, Lasso, and SGD models will be built and tested on the previously noted subsets/variations of the data to identify the best predictor possible.

Every model will be run on the full datasets, as well as the recent datasets, but of the remaining variants, only the non-genre split variants will be run unless a noticible improvement is observed in the resulting model. In which case the variant will also be run per genre.

Training models to predict chart rank with all data

Data Description

All data pulled across all genres is used.

Predictions

Overall performace is expected to be low. It is assumed the data is semi-clustered by genre and grouping everything together will cause outlier songs to throw model off. Model performance is expected to be somewhat poor.

All Data

80/20 Train/Test Split

All Data Analysis

Not grouping the data has, predicitably, terrible results. The average RMSE from the models is around 25.5, i.e., 25.5% of the max value of the chart rankings. If there are strong trends within the genres, keeping the data all together like this would produce poor results as many songs would be treated as outliers, whereas within thier own genre, they might not be.

All Data - Recent 3 years

80/20 Train/Test Split

All Data - Recent Analysis

The recent 3 years dataset has produced an average 3% decrease in accuracy for the final models, although cross validation accuracies were on average a little better. As with all data above, if strong trends existed it is assumed they would be prevelent in the genre breakdowns, and grouping the data together would result in a poorer model.

All Data - No duration

80/20 Train/Test Split

All Data - No Duration Analysis

Duration was removed to see if the classification issues we had with it would show up in our prediction models as well. The average RMSE calculated with duration is not noticably different from that of the original all data run, so duration does not seem to be having a large impact one way or another on our results.

All Data - PCA

80/20 Train/Test Split

All Data - PCA Analysis

PCA was performed to see if the variables had strong linear relations with one another that could be removed, and perhaps help with overfitting. The result is again not distinct from the original all data run.

All Data - No Correlated Variables

80/20 Train/Test Split

All Data - No Correlated Variables Analysis

Again our models have not improved in any noticeable way after dropping correlated variables. Our 'all' dataset has 54 features, dropping from 53 through 30 had no strong impact on accuracy.

Training models to predict chart rank split by genre

Data Description

Data is split by genre - country, pop, latin, r&B, jazz - and operated on independently of one another

Predictions

Overall performace is expected to increase. It is assumed the data is semi-clustered by genre and by splitting it accordingly, fewer outliers will be present. Model performance is expected to be average to good.

All Data (per genre)

80/20 Train/Test Split

All data (per genre) - Analysis

Breaking by genre has had disappointingly low results. The genre breakdown does not seem to have a large effect on overall model accuracy, with the exception of R&B which is ~4% worse on average than the all data runs. (note that jazz and pop only track top 50 songs, so the lower values these produce, when considered on the scale of 100 from the others and from all data, is equally as poor).

All Data (per genre) - Recent 3 years

80/20 Train/Test Split

All Data (per genre) - Recent Analysis

Surprisingly, using only the last 3 years of data has produced significantly worse results than any other data breakdown, with the average being around 1-2% worse, but country producing a model 6% worse on average than using all data.

Splitting on the last 3 years of data was done based on the assumption that music trends change with time, and using all historical data could potentially throw off the model. It seems the general trends are stronger when looking at the data as a whole than just the past 3 years.

Overall Analysis

Regardless of parameters or data, the models produced consistently performed poorly. The best RMSE obtained was around 25% of the range of possible values, regardless of data breakdown or variant, with worse models ending up with over 30%.

Splitting by genre, while initially assumed to be the best option, produced, at best, equivalent results to no breakdown at all, and at worst, the worst models made overall.

The results here agree with most of the clustering and classification results observed, which is to say, there does not appear any strong trends between our data and song popularity.